Stochastic Approximation for Non-Expansive Maps:1 Application to Q-Learning Algorithms

نویسندگان

  • Jinane Abounadi
  • Dimitri P. Bertsekas
  • Vivek Borkar
چکیده

We discuss synchronous and asynchronous variants of fixed point iterations of the form xk+1 = xk + γ(k) ( F (xk, ξk)− xk ) , where F is a non-expansive mapping under a suitable norm, and {ξk} is a stochastic sequence. These are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s Lemma for the synchronous case or Borkar’s Theorem for the asynchronous case. However, the analysis requires that the iterates {xk} are bounded, a fact which is usually hard to prove. We develop a novel framework for proving boundedness, which is based on scaling ideas and properties of Lyapunov functions. We then combine the boundedness property with Borkar’s stability analysis of ODE’s involving non-expansive mappings to prove convergence with probability 1. We also apply our convergence analysis to Q-learning algorithms for stochastic shortest path problems and we are able to relax some of the assumptions of the currently available results. 1 Research supported by NSF under Grant 9600494-DMI. Thanks are due to John Tsitsiklis whose suggestions resulted in important simplifications of the lemmas in Section 2. 2 Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, Mass., 02139. 3 School of Technology and Computer Science, Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400005, India. 4 The research of V. Borkar was supported in part by the Homi Bhabha Fellowship, and Govt. of India, Dept. of Science and Technology grant No. III 5(12)/96-ET. 1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic approximation for non-expansive maps : application to Q-learning algorithms

We discuss synchronous and asynchronous iterations of the form x = x + γ(k)(h(x) + w), where h is a suitable map and {wk} is a deterministic or stochastic sequence satisfying suitable conditions. In particular, in the stochastic case, these are stochastic approximation iterations that can be analyzed using the ODE approach based either on Kushner and Clark’s lemma for the synchronous case or on...

متن کامل

Two-Timescale Q-Learning with an Application to Routing in Communication Networks

We propose two variants of the Q-learning algorithm that (both) use two timescales. One of these updates Q-values of all feasible state-action pairs at each instant while the other updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A sketch of convergence of the algorithms is shown. Finally, numerical experiments using the proposed algorithms fo...

متن کامل

Empirical Q-Value Iteration

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov Decision Process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and ‘actor-critic’ algorithms, this algorithm doesn’t depend on a stochastic approximation-based method. We show that our algorithm, which we call ...

متن کامل

Lids - P - 2172 Asynchronous Stochastic Approximation and Q - Learning 1

We provide some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. We then use these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establish its convergence under conditions more general than previously available.

متن کامل

New algorithms of the Q-learning type

We propose two algorithms for Q-learning that use the two timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments usin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998